Motivation

Provide an overview of the project goals and motivation

The motivation for a project exploring the correlation between low birth weight, premature birth rate, cancer mortality rate, cardiovascular disease hospitalization rate, asthma hospitalization rate and PM2.5 concentration, as well as socioeconomic, racial, education, and gender factors across the 62 counties of New York State, stems from a compelling public health imperative. All the outcomes of interest are critical indicators of population health. By examining the impact of fine particulate pollution, this project aims to uncover environmental health disparities that disproportionately affect under-served communities. It seeks to inform public health policy by highlighting the need for targeted interventions that could mitigate the adverse effects of air pollution on vulnerable populations, including pregnant women and newborns. The analysis of socioeconomic and racial factors will offer insights into the complex interplay between environment and social determinants of health, potentially guiding urban planning and healthcare resource allocation. Furthermore, by considering gender-specific vulnerabilities, the project aims to provide a comprehensive overview of the risk landscape, fostering community awareness and prompting action to improve air quality and health equity in one of the most densely populated urban areas in the world.

Initial Questions: KEVIN

What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?

Data

Source, scraping method, cleaning, etc.

Data Import

Data Wrangling

PM2.5 Dataset

pm2_5 <- pm2_5 %>% 
  janitor::clean_names()%>%
  select (county,value) %>%
  rename (annual_pm2.5 = "value")

The dataset is obtained from EPH Tracking Website from CDC (https://ephtracking.cdc.gov/DataExplorer/). This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- annual_pm2.5: annual estimated PM2.5 concentration at each county (ug/m^3)

Demographic Dataset

The following set of demographic datasets is obtained from Census Reporter webpage (https://censusreporter.org/).

Education Attainment

edu <- edu_NY %>% 
  janitor::clean_names()%>%
  mutate (percentage_high_education = (bach_male+master_male+prof_male+doct_male+bach_female+master_female+prof_female+doct_female)/total) %>%
  filter(str_detect (name, " County, NY")) %>% 
  mutate(county = str_replace (name, " County, NY", "")) %>%
  select(county,percentage_high_education) %>%
  mutate(county = str_replace (county, "St.", "St")) %>%
  mutate(county = str_replace (county, "Stuben", "Steuben"))%>%
  select(county,percentage_high_education)

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percentage_high_education: percentage of population who finished a higher education level (higher than a bachelor degree)

Ethnicity

race_non_hisp_white <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "White Non-Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_non_hisp_white") %>% 
  select (county,percent_non_hisp_white) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_non_hisp_white ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_non_hisp_black <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "Black Non-Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_non_hisp_black") %>% 
  select (county,percent_non_hisp_black) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_non_hisp_black ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_hisp_white <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "White Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_hisp_white") %>% 
  select (county,percent_hisp_white) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_hisp_white ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_hisp_black <- race_NY %>% 
  janitor::clean_names()%>%
  filter( x1 == "Black Hispanic") %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_hisp_black") %>% 
  select (county,percent_hisp_black) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_hisp_black ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

race_merge <- race_non_hisp_white %>%
  inner_join(race_non_hisp_black, by = "county") %>%
  inner_join(race_hisp_white, by = "county") %>%
  inner_join(race_hisp_black, by = "county")

race_merge <- race_merge %>% 
  mutate (percent_other = 1 - percent_non_hisp_white - percent_non_hisp_black - percent_hisp_white - percent_hisp_black)

This has 62 rows and 6 columns of data. In which, 6 variables are:
- county: NY county name
- percent_non_hisp_white: percentage of population who are identified as Non-Hispanic White
- percent_non_hisp_black: percentage of population who are identified as Non-Hispanic Black
- percent_hisp_white: percentage of population who are identified as Hispanic White
- percent_hisp_black: percentage of population who are identified as Hispanic Black
- percent_other: percentage of population who are identified as any other ethinic groups

Household Income

income <- HHincome_NY %>%
  janitor::clean_names() %>% 
  filter (x1 == "percent_high_income") %>%
   pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_high_income") %>%
  select (county,percent_high_income) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_high_income ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percent_high_income: percentage of population who are at higher income households (>$75,000 annually)

Median Age

age <- Age_NY %>% 
  janitor::clean_names() %>% 
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "median_age") %>%
  select (county,median_age) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,median_age ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- median_age: median age of the population at each county

Sex

sex <- Sex_NY %>% 
  janitor::clean_names() %>% 
  filter (x1 == "Male:") %>%
  pivot_longer(
    albany_county_ny : yates_county_ny,
    names_to = "county",
    values_to = "percent_male") %>%
  select (county,percent_male) %>%
  separate(county, into = c("county", "x"), sep = "_county_") %>%
  select (county,percent_male ) %>%
  mutate(county = str_replace (county, "_", " ")) %>%
  mutate_at(vars(county), str_to_title)

This has 62 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percent_male: percentage of population who are identified as male

Low Birthweight Data

lowbirthweight <- lowbirthweight %>% 
  janitor::clean_names()%>%
  select (region_county, percentage)%>%
  rename (county = "region_county", percent_lowbirthweight = "percentage") 

This has 87 rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- percent_lowbirthweight: percentage of children being born being identified as low birth weight (<2,500g) (https://www.health.ny.gov/)

Health Indicator Dataset

The following dataset is obtained from New York State Department of Health (https://www.health.ny.gov/) that contains 4 different health indicators that will complement with the low birthweight. These 5 will act as our outcomes of interest for our regression models.

health <- health %>%
  janitor::clean_names()%>%
  select (geographic_area,indicator_title,topic_area,rate_percent,measurement) %>%
  filter( str_detect (geographic_area, " County")) %>% 
  mutate (county = str_replace (geographic_area, " County", "")) %>%
  select (county, everything()) %>%
  select (-geographic_area) %>%
  filter(topic_area == "Cancer Indicators" | topic_area == "Respiratory Disease Indicators" | topic_area == "Cardiovascular Disease Indicators" | topic_area == "Maternal and Infant Health Indicators")
  
cancer <- health %>% 
  filter (topic_area == "Cancer Indicators") %>% 
  filter (indicator_title == "All cancer incidence rate per 100,000") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>%
  mutate(rate_percent = as.numeric(rate_percent)/10) %>%
  rename (cancer_mortality_per_10k = "rate_percent")

resp <- health %>%
  filter (topic_area == "Respiratory Disease Indicators") %>%
  filter (indicator_title == "Asthma hospitalization rate per 10,000") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>% 
  rename (asthma_hosp_rate_per_10k = "rate_percent")

cardio <- health %>%
  filter (topic_area == "Cardiovascular Disease Indicators") %>%
  filter (indicator_title == "Cardiovascular disease hospitalization rate per 10,000") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>% 
  rename (cardio_hosp_rate_per_10k = "rate_percent")

maternal <- health %>%
  filter (topic_area == "Maternal and Infant Health Indicators") %>%
  filter (indicator_title == "Percentage of premature births with <37 weeks gestation") %>% 
  select (-c(indicator_title, topic_area, measurement)) %>% 
  rename (premature_percentage = "rate_percent")

They are:
- cancer_mortality_per_10k: percentage of cancer mortality per 100 thousands people in each NY county (Cancer Indicator)
- asthma_hosp_rate_per_10k: percentage of asthma hospitalization per 10 thousands people in each NY county (Respiratory Disease Indicator)
- cardio_hosp_rate_per_10k“: percentage of cardiovascular-disease-related hospitalization per 10 thousands people in each NY county (Cardiovascular Disease Indicator)
- premature_percentage: percentage of children being born prematurely (<37 gestational weeks) in each NY county

Merge dataset

Here we perform inner_join() to create 2 bigger datasets health_merge and demographic_merge. Then, we join them with our lowbirthweight & pm2_5 tp make a finalized data frame called merge. And, we will use this for regression model.

health_merge <- maternal %>% 
  inner_join(cancer, by = "county") %>%
  inner_join(resp, by = "county") %>% 
  inner_join(cardio, by = "county") 

demographic_merge <- age %>%
  inner_join(sex,  by = "county") %>%
  inner_join(income, by = "county") %>% 
  inner_join(race_merge, by = "county") %>% 
  inner_join(edu, by = "county") %>%
  mutate(county = str_replace (county, "St ", "St. "))
  

merge <- lowbirthweight %>% 
  inner_join(pm2_5, by = "county") %>%
  inner_join(health_merge, by = "county") %>% 
  inner_join(demographic_merge, by = "county") 

merge <- merge %>% 
  select (county, annual_pm2.5,everything())%>%
  mutate(asthma_hosp_rate_per_10k = as.numeric(asthma_hosp_rate_per_10k))%>%
  mutate(cardio_hosp_rate_per_10k = as.numeric(cardio_hosp_rate_per_10k)) %>%
  mutate(premature_percentage = as.numeric(premature_percentage))%>%
  mutate(cancer_mortality_per_10k = as.numeric(cancer_mortality_per_10k))

Making Map File

uscounties <- uscounties %>% 
  filter (state_id == "NY") %>%
  select (county, lat, lng)

map <- merge %>% 
  inner_join(uscounties, by ="county")

write.csv(map, "NY_map.csv", row.names = FALSE)

This file is specifically made for the purpose of making map (in Shiny App).

Exploratory Data Analysis (EDA)

Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.

Summary

Percentage of High Income

merge %>%
 plot_ly(
    x = ~reorder(county, percent_high_income),
    y = ~percent_high_income,
    type = "bar",
    marker = list(color = "red1")
  ) %>%
  layout(
    title = "Percentage of High Income",
    xaxis = list(title = "County Name", categoryorder = "total descending"),
    yaxis = list(title = "Percentage"),
    barmode = "stack"
  )
Racial Composition
race_plot <- merge %>% 
  select (county, percent_non_hisp_white, percent_non_hisp_black, percent_hisp_white, percent_hisp_black, percent_other) %>%
  pivot_longer(
    cols = starts_with("percent_"), names_to = "race", values_to = "percentage") 

race_plot%>%
  plot_ly(x = ~county, y = ~percentage, type = "bar",color = ~race,colors = "RdYlGn", hoverinfo = "y+name") %>% 
  layout(barmode = "stack",
         title = "Racial Composition in NY county",
         xaxis = list(title = "County"),
         yaxis = list(title = "Percentage (%)"))
Outcome Graphs
Low birthweight Rate
LBR_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~percent_lowbirthweight,
    color = ~county,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Low Birthweight",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Percentage of Low Birthweight"),
    showlegend = FALSE
  )
Premature Birth
premature_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~premature_percentage,
    color = ~county,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Premature Birth",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate of Premature Birth"),
    showlegend = FALSE
  )
Asthma Hospitalization
asthma_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~asthma_hosp_rate_per_10k,
    color = ~county,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Asthma Hospitalization",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate per 10k"),
    showlegend = FALSE
  )
Cancer Mortality Rate
cancer_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~cancer_mortality_per_10k,
    color = ~county,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Cancer Mortality",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate per 100k"),
    showlegend = FALSE
  )
Cardiovascular Rate
cardio_graph <- merge %>%
  drop_na() %>%
  plot_ly(
    x = ~county,
    y = ~cardio_hosp_rate_per_10k,
    color = ~county,
    type = "scatter",
    mode = "markers"
  ) %>%
  layout(
    title = "Rate of Cardiovascular Disease Hospitalization",
    xaxis = list(title = "County", tickangle = 90),
    yaxis = list(title = "Rate per 10k"),
    showlegend = FALSE
  )

Combine Plot.ly

fig <-  plot_ly()

fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$percent_lowbirthweight, type = "scatter", mode = "markers", name = "Low Birthweight Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$premature_percentage, type = "scatter", mode = "markers", name = "Premature Birth Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$asthma_hosp_rate_per_10k, type = "scatter", mode = "markers", name = "Asthma Hospitalization Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$cancer_mortality_per_10k, type = "scatter", mode = "markers", name = "Cancer Mortality Rate")
fig <- fig %>% add_trace(x = ~merge$county, y = ~merge$cardio_hosp_rate_per_10k, type = "scatter", mode = "markers", name = "Cardiovascular Disease Hospitalization Rate")

fig <- fig %>% layout(
  title = "Overlay of 5 Outcomes",
  xaxis = list(title = "County", tickangle = 90),
  yaxis = list(title = "Rate"),
  showlegend = TRUE
)
fig

Statistical Analysis

If you undertake formal statistical analyses, describe these in detail

After cleaning the data, we then look at the effect that the annual concentration of pm 2.5 separately has on multiple outcomes among 62 counties in New York, including low birth weight rate, premature birth rate, cancer mortality rate, asthma-related hospitalization rate, and cardiovascular-disease-related hospitalization rate.

Since all of the outcomes are measured as rates (i.e., continuous), we use linear regression models to assess the effect of annual concentration of pm 2.5 on the outcomes of interests, adjusting for age, sex, ethnicity, education, income. In our models:

For each of the five outcomes of interest, we first start with the full model:

Outcome = Annual pm 2.5 concentration + median age + percentage of male + percentage of Hispanic-Black + percentage of Hispanic-White + percentage of non-Hispanic-Black + percentage of non-Hispanic-White + percentage of household income exceeding 75,000 USD + percentage of people obtained higher education

We then perform bidirectional stepwise selection on all five models and output the best models. The full model and the resulting best models after stepwise selection for each outcome are detailed below:

Low Birth Weight Rate vs. Annual PM2.5 Concentration

Linear regression
lm_pm2.5_birthweight_adjusted <-lm(percent_lowbirthweight ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_birthweight_adjusted_best <- step(lm_pm2.5_birthweight_adjusted, direction = 'both', trace=FALSE)

summary(lm_pm2.5_birthweight_adjusted_best)%>%
  tab_model()
  percent lowbirthweight
Predictors Estimates CI p
(Intercept) 11.53 7.18 – 15.88 <0.001
median age -0.11 -0.21 – -0.02 0.024
percent non hisp black 8.05 1.34 – 14.76 0.019
Observations 62
R2 / R2 adjusted 0.274 / 0.249

Interpretation

The best model that was outputted by the stepwise regression is

$ = 11.53 - 0.11() + 8.05 ()$

On average, for counties whose median age is 0 and has zero percent of non-Hispanic-Black, the expected low birth weight rate is 11.5% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.

On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.05%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

The adjusted R-squared of 0.249 implies that 24.9% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).

Premature Birth Weight Rate vs. Annual pm 2.5 Concentration

Linear regression
lm_pm2.5_premature_adjusted <-lm(premature_percentage ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_premature_adjusted_best <- step(lm_pm2.5_premature_adjusted, direction = 'both', trace=FALSE)

summary(lm_pm2.5_premature_adjusted_best)%>%
  tab_model()
  premature percentage
Predictors Estimates CI p
(Intercept) 4.88 0.96 – 8.80 0.016
median age 0.08 -0.01 – 0.17 0.067
percent non hisp black 8.03 2.63 – 13.42 0.004
Observations 61
R2 / R2 adjusted 0.135 / 0.105

Interpretation

The best model that was outputted by the stepwise regression is

$ = 4.88 + 0.08() + 8.03 ()$

On average, for counties whose median age is 0 and has zero percent of non-Hispanic-Black, the expected low birth weight rate is 4.88% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.

On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).

Cancer Mortality Rate vs. Annual pm 2.5 Concentration

Linear regression
lm_pm2.5_cancer_adjusted <-lm(cancer_mortality_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_cancer_adjusted_best<-step(lm_pm2.5_cancer_adjusted, direction = 'both', trace=FALSE)

summary (lm_pm2.5_cancer_adjusted_best)%>%
  tab_model()
  cancer mortality per 10 k
Predictors Estimates CI p
(Intercept) -0.18 -51.73 – 51.36 0.994
annual pm2 5 1.33 0.10 – 2.55 0.035
median age 1.31 0.98 – 1.65 <0.001
percent high income -21.35 -37.32 – -5.38 0.010
percent non hisp white 46.40 20.09 – 72.71 0.001
percent non hisp black 53.63 2.27 – 104.99 0.041
percent hisp white 102.38 10.35 – 194.41 0.030
percent hisp black -178.74 -417.28 – 59.79 0.139
percent male -57.41 -140.92 – 26.09 0.174
Observations 61
R2 / R2 adjusted 0.809 / 0.780

Interpretation

The best model that was outputted by the stepwise regression is

$ = -1.85 + 13.27() + 13.14() - 213.49() + 464.03() + 536.28() + 1023.82() - 1787.43() - 574.15()$

On average, for counties whose annual pm 2.5 concentration is 0, median age is 0, has zero percent of non-Hispanic-Black, has zero percent of non-Hispanic-White, has zero percent of Hispanic-Black, has zero percent of Hispanic-White, has zero percent of male, the expected low birth weight rate is -1.85% (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.

On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.

The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).

Asthma Hospitalization rate vs. Annual pm 2.5 Concentration
Linear regression
lm_pm2.5_asthma_adjusted <-lm(asthma_hosp_rate_per_10k ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_asthma_adjusted_best <- step(lm_pm2.5_asthma_adjusted, direction = 'both', trace=FALSE)

summary(lm_pm2.5_asthma_adjusted_best)%>%
  tab_model()
  asthma hosp rate per 10 k
Predictors Estimates CI p
(Intercept) 1.36 0.06 – 2.67 0.041
percent high income 4.65 0.32 – 8.99 0.036
percent non hisp black 13.52 5.77 – 21.26 0.001
percent hisp black 366.67 314.84 – 418.50 <0.001
percentage high education -5.31 -9.55 – -1.06 0.015
Observations 59
R2 / R2 adjusted 0.935 / 0.930
Cardiovascular Disease rate vs. Annual pm 2.5 Concentration
Linear regression
lm_pm2.5_cardio_adjusted <-lm(cardio_hosp_rate_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
Stepwise selection
lm_pm2.5_cardio_adjusted_best<-step(lm_pm2.5_cardio_adjusted, direction = 'both', trace=FALSE)

summary (lm_pm2.5_cardio_adjusted_best)%>%
  tab_model()
  cardio hosp rate per 10 k
Predictors Estimates CI p
(Intercept) 157.52 -155.87 – 470.91 0.318
annual pm2 5 6.97 -0.14 – 14.08 0.055
median age 1.73 0.01 – 3.44 0.049
percent non hisp white 117.26 -30.46 – 264.97 0.117
percent non hisp black 222.99 -83.73 – 529.70 0.151
percent hisp white 439.56 -49.52 – 928.63 0.077
percentage high education -148.61 -224.74 – -72.49 <0.001
percent male -405.53 -920.48 – 109.43 0.120
Observations 62
R2 / R2 adjusted 0.329 / 0.242

Discussion: KEVIN

What were your findings? Are they what you expect? What insights into the data can you make?